Thera Bank - Credit Card Customer Churn Prediction

Summary

Thera Bank is concerned about a recent decline in the number of credit card users they have in their customer base. Retaining customers who have a credit card is important for the banks revenue. Therefore, the bank is interested in a predictive model that could assess whether a customer who does have a credit card is lokely to churn or not.

Purpose: Thera Bank wants a predictive classification model that can predict which customers are more likely to churn or not and to better understand what factors may lead a customer to churn from having a credit card with the bank.

Objectives:

Data Dictionary

  1. CLIENTNUM: Client number (unique identifier)
  2. Attrition_Flag: "Attrited Customer": account is closed, "Existing Customer": still retains an account. This is the target variable.
  3. Customer_Age: Age in years.
  4. Gender: Gender of the account holder
  5. Dependent_count: Number of dependents
  6. Education_Level: Includes: Graduate, High School, Unkown, Uneducated, College (as in is a college student), Post-Graduate, Doctorate
  7. Maritial_Status: Maritial Status of account holder
  8. Income_Category: Annual Income Category of the account holder
  9. Card_Category: Type of card
  10. Months_on_book: Period of relationship with bank
  11. Total_Relationship_Count: Total no. of products held by the customer
  12. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  13. Contacts_Count_12_mon: No. of Contacts between customer and bank in the last 12 months
  14. Credit_Limit: Credit Limit on the CC
  15. Total_Revolving_Bal: Balance that carries over from one month to the next
  16. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  17. Total_Trans_Amt: Total Transaction Amount (Last 12 Months)
  18. Total_Trans_Ct: Total Transaction Count (Last 12 Months)
  19. Total_Ct_Chng_Q4_Q1: Ratio of the transaction count in 4th quarter and the total transaction count in the 1st quarter
  20. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in the 4th quarter and the total transaction amount in the 1st quarter
  21. Avg_Utilization_Ratio: How much of the available credit the customer spent

Data Exploration & Data Cleaning

Initial assessment of the data

Notes

Notes

Notes:

Notes

Notes

Notes

Notes

Notes

Next Steps

  1. Adjust the types to correpsond to the correct data type for each feature
  2. Set Index as CLIENTNUM
  3. Replace "abc" in Income Category as NaNs for later interpolation
  4. Further examination of the data, before data split and modeling

No obvious patterns within the NaN values that could be easily addressed later

Notes

Preparing Prelinary Model Dataset

Notes:

There a limited number of strong correlations within this preliminary model dataset, which should limit collinearities within the model features.

Classification Models

Primary objective: To make a model that classifies whether or not a customer will churn from the list of customers who have a credit card account. Specifically, the bank is concerned about its services to the customer and what they may be able to change or improve upon to limit the churn.

This means the only variable upon which the model should be built are customer service related features of the data. Features that are related to charateristics of the customer are not relevant. The bank is interested in retaining customers, limiting churn, regardless of the characteristics of the customer.

The features that pertain to customer service or features the bank can influence are:

  1. Attrition_Flag (the target variable): whether a customer churned or not
  2. Card_Category: What type of card the customer holds/held (Blue, Silver, Gold, or Platinum)
  3. Months_on_book: Period of relationship with the bank
  4. Total_Relationship_Count: Total number of products held by the customer
  5. Months_Inactive_12_mon: Number of months inactive in the last 12 months
  6. Contacts_Count_12_mon: Number of contacts between the custiomer and bank in the last 12 months
  7. Credit_LImit: Credit limit on the customers credit card
  8. Total_Revolving_Bal: The balance that carries over from one month to the next
  9. Avg_Open_To_Buy: The average amount left on the credit card to use (average is over last 12 months)
  10. Total_Trans_Amt: Total Traansaction Amount (last 12 months)
  11. Total_Trans_Ct: Total Transaction Count (last 12 months)
  12. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in the 1st quarter
  13. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in the 4th quarter and the total transaction amount in the 1st quarter
  14. Avg_Utilization_ratio: How much available credit the customer spent

The features that pertain to the customer that the bank can not influence are:

  1. Customer_Age
  2. Gender
  3. Dependent_count
  4. Education_Level
  5. Marital_Status
  6. Income_Category

The key metric for the models will be Recall.

The models will aim to avoid false negatives, which would be failing to identify a customer who does churn. This will lead to higher rates of false positives, which would be incorrectly classifying a customer as churning when they would not.

The bank is concerned with not losing additional customers, therefore, it is most important to identify a customer who is likely to churn. For these models, it is assumed that identifying a customer that does not end up churning, leading to the bank spending resources attempting to ensure they stay a customer, is considered a lower cost than failing to identify a customer who does churn. As a result, the models will optimize Recall more so than Precision or Accuracy.

Imbalanced Dataset Model training performaces (base model)

Balanced Oversampled Dataset Model training performaces (oversampled models)

Balanced Undersampled Dataset Model training performaces (undersampled models)

Set functions for model performance assessment

Gradient Boost Classifier for base dataset

XGBoost Classifier for oversampled dataset

Gradient Boost Classifier for undersampled dataset

Model Comparison

The XGBoost Classifier with the oversampled dataset with SMOTE performs the best. There is some slight overfitting, but this is true for all the models. While the Recall is not the highest, it is still a recall of 0.97 while maintaining all other metrics at 0.93 and above.

Model Performance on Test dataset

Pipeline for production of the model

Business Recommendations